Similarity Testing Between Heterogeneous Basket Datasets
نویسندگان
چکیده
This paper proposes a new measure for similarity between basket datasets. The new measure is calculated from support counts using a formula inspired by information entropy. Experiments on both real and synthetic datasets show the effectiveness of the measure. This paper also studies the problem of finding a mapping between categorical database attribute sets using similarity measures. A generic approach for identifying such a mapping is proposed. The approach is implemented based on the similarity measure proposed in the paper and its performance has been evaluated and validated. Moreover, this paper also explores the applications of using the similarity measure to mine distributed datasets. The project is supported in part by NIH/PHS Grant 5-P41-RR09283, NIH grant RO1-AG18231, NIH grant P30AG18254, NSF Grant DUE-9980943, NSF grant EIA-0080124, NSF grant EIA-0205061 and NSF CCR-9701911.
منابع مشابه
Association-based similarity testing and its applications
This paper proposes a new similarity measure between basket datasets based on associations. The new measure is calculated from support counts using a formula inspired by information entropy. Experiments on both real and synthetic datasets show the effectiveness of the measure. This paper then investigates the applications of the similarity measure. It first studies the problem of finding a mapp...
متن کاملAn Empirical Comparison of Distance Measures for Multivariate Time Series Clustering
Multivariate time series (MTS) data are ubiquitous in science and daily life, and how to measure their similarity is a core part of MTS analyzing process. Many of the research efforts in this context have focused on proposing novel similarity measures for the underlying data. However, with the countless techniques to estimate similarity between MTS, this field suffers from a lack of comparative...
متن کاملDomain-Specific Ontology Mapping by Corpus-Based Semantic Similarity
Mapping heterogeneous ontologies is usually performed manually by domain experts, or accomplished by computer programs via comparing the structures of the ontologies and the linguistic semantics of their concepts. In this work, we take a different approach to compare and map the concepts of heterogeneous domain-specific ontologies by using a document corpus in a domain similar to the domain of ...
متن کاملRiemannian Similarity Learning
We consider a similarity-score based paradigm to address scenarios where either the class labels are only partially revealed during learning, or the training and testing data are drawn from heterogeneous sources. The learning problem is subsequently formulated as optimization over a bilinear form of fixed rank. Our paradigm bears similarity to metric learning, where the major difference lies in...
متن کاملAvgsim: Relevance Measurement on Massive Data in Heterogeneous Networks
Heterogeneous information network includes multiple types of objects and multiple types of links. Compared with Homogeneous information network which only contains objects of the same type, heterogeneous information network has more abundant semantic information. Heterogeneous information network is very common in our daily life, such as social networks. Similarity search in heterogeneous infor...
متن کامل